Youtube-TedEx

Group 2

2024-04-22

Recap of Analytics Plan

Descriptive Statistics: Analyzed engagement statistics including views, likes, and comments.
Performance and Engagement Metrics: Compared engagement metrics across categories, video duration, and tags.
Content Analysis: Utilized topic modeling and keyword extraction to identify themes within the content.
Trend Analysis: Examined trends in viewership and engagement over time.

Tools Used:
- Data Extraction and Parsing: tuber, httr, jsonlite, tidyverse, skimr, recipes, dplyr, tidyr, lubridate
- Machine Learning: h2o
- Visualization: plotly, ggplot2

Peer comments Summmary

Data Summary

- The data is extracted using the Tuber Api.
- The dataset offers a comprehensive collection of TEDx talks from the TedEx YouTube channel, featuring talks aimed at inspiring, educating, and sparking discussions on various important subjects.
- Each entry includes details such as the video ID, publication time, title, description, tags, category ID, default audio language, duration, dimension, caption availability, licensed content status, view count, like count, favorite count, and comment count.
- The dataset offers insights into the content and engagement metrics of these TedEx talk videos , showcasing diverse topics and audience responses.

## Rows: 20,268
## Columns: 12
## $ Utc_Day_Part           <chr> "Afternoon", "Afternoon", "Afternoon", "Afterno…
## $ Month                  <chr> "March", "March", "March", "March", "March", "M…
## $ Day_Of_Week            <chr> "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tu…
## $ Title                  <chr> "The Great Diffusion | Alex Lazarow | TEDxSonom…
## $ Description            <chr> "Over the last 150 years, unprecedented technol…
## $ Tags                   <chr> "Business,Economics,English,Entrepreneurship,Fu…
## $ Duration_Minutes       <dbl> 10, 12, 11, 16, 16, 7, 6, 10, 12, 10, NA, 11, 1…
## $ Default_Audio_Language <chr> "en", "en", "en", "en", "en", "en", "pl", "pl",…
## $ Caption                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,…
## $ View_Count             <dbl> 77, 71, 313, 62, 180, 347, 119, 179, 90, 27419,…
## $ Like_Count             <dbl> 3, 2, 13, 0, 10, 17, 4, 4, 3, 72, 1500, 18, 12,…
## $ Comment_Count          <dbl> 0, 0, 4, 0, 0, 15, 1, 1, 0, 40, 49, 3, 0, 0, 5,…

Data Wrangling


- Categorized “Published_Time” to Night, Morning, Afternoon, Evening UTC Day Parts
- Extracted day of the week from “Published_Time”
- Extracted month from “Published_Time”
- Extracted minutes from “Duration”
- Pre-processed “Tags” column by removing unnecessary keywords
- Removed low variance columns
- Excluded videos from the last 3 weeks
- Calculated date 3 weeks before max date
- Filtered out rows within last 30 days
- Separated data for text and non-text models
- Factorized categorical variables using recipe and bake
- Saved the post-processed data into a .rds format

Data Exploration

Data Exploration

Data Exploration

Data Exploration

Show example data and if your ML models involve structured data, demonstrate that you consider at least 20 predictors

What tags should I use for my content?

ggplot(
  word_probs,
  aes(term, beta, fill=as.factor(topic))
) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

When should I upload my video to boost engagement?

plot(dl)

What should be the duration length for my video?

plot(gbm)

XAI

Deep Learning XAI CPP

plot(h2o_exp_dl_cp, variables = c("View_Count","Duration_Minutes")) +
ggtitle("View_Count")

GBM XAI SHAP

plot(h2o_exp_gbm_shap) + ggtitle("SHAP explaination")

Key Takeaways